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LI retrotransposons comprise 17% of the human genome and are its only autonomous mobile elements. Although 
Ll-induced insertional mutagenesis causes Mendelian disease, their mutagenic load in cancer has been elusive. Using Ll- 
targeted resequencing of 16 colorectal tumor and matched normal DNAs, we found that certain cancers were excessively 
mutagenized by human-specific Lis, while no verifiable insertions were present in normal tissues. We confirmed de novo LI 
insertions in malignancy by both validating and sequencing 69/107 tumor-specific insertions and retrieving both 5' and 3' 
junctions for 35. In contrast to germline polymorphic Lis, all insertions were severely 5' truncated. Validated insertion 
numbers varied from up to 17 in some tumors to none in three others, and correlated with the age of the patients. 
Numerous genes with a role in tumorigenesis were targeted, including ODZ3, R0B02, PTPRM, PCM1, and CDH11. Thus, 
somatic retrotransposition may play an etiologic role in colorectal cancer. 



[Supplemental material is available for this article.] 

Over two-thirds of our genome may stem from "jumping genes" 
(de Koning et al. 201 1). Three classes of retroelements are known to 
be currently active and a source of human disease: long inter- 
spersed elements (LINEs), the prototype of which is the RNA 
polymerase II transcribed LI; short interspersed elements (SINEs), 
consisting essentially of RNA polymerase III transcribed Aim; and 
SVAs (SINE-R/VNTR/A/ms) that are intermediate in size relative to 
Aim and Lis, and are likely transcribed by RNA polymerase II. A 
fourth class of retroelements in our genome, human endogenous 
retroviruses (HERVs) is considered immobile. Full-length Lis are 
not only responsible for mobilizing themselves, but also for mo- 
bilizing the nonautonomous Alu (Dewannieux et al. 2003) and 
SVA retrotransposons (Ostertag et al. 2003; Hancks et al. 201 1; Raiz 
et al. 2011), inactive Lis (Moran et al. 1996), small RNAs (Gilbert 
et al. 2005), and classical mRNAs, thereby creating processed pseu- 
dogenes (Esnault et al. 2000; Wei et al. 2001; Ohshima et al. 2003). 

Although there are about half a million Lis in the human 
genome, only the human-specific Lis (LIHs) are currently active, 
represented in each individual by about 800 germline copies 
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(Ewing and Kazazian 2010), including —200 full-length sequences 
(Boissinot et al. 2000). According to conservative estimates there 
are only about 100 active LIHs in any human diploid genome that 
are retrotranspositionally competent, of which six from the refer- 
ence genome and 37 from six other genomes are known to be 
highly active ("hot") (Brouha et al. 2003; Beck et al. 2010). Lis 
retrotranspose through a process called target-primed reverse 
transcription (TPRT) (Luan et al. 1993; Cost et al. 2002) with the 
help of the Ll-encoded proteins open reading frame 1 protein 
(ORFlp) and ORF2p. Endonuclease and reverse transcriptase ac- 
tivities for LI integration are provided by ORF2p (Mathias et al. 
1991; Feng et al. 1996). The hallmarks of TPRT are the addition of 
a new poly(A) tail to the integrated sequence and target-site du- 
plication (TSD), usually 6-20 bp in length. A fraction of retro- 
transposition events are also associated with 3' transduction, the 
comobilization of 3' flanking DNA sequences (Holmes et al. 1994; 
Moran et al. 1999; Goodier et al. 2000; Pickeral et al. 2000), 
resulting from transcriptional read-through of the weak LI poly(A) 
signal and preferential use of a stronger downstream poly(A) sig- 
nal. Most de novo LI retrotransposition events are 5' truncated 
(Gilbert et al. 2005), with one extreme truncation described where 
the whole LI sequence was missing and only the 3' transduced 
sequence was present (Solyorn et al. 2012). 

Active mobile elements are not only a significant source of 
intra- and interindividual variation, but can also act as insertional 
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mutagens. There are 97 known disease-associated retrotransposon 
insertions into protein-coding genes (Hancks and Kazazian 2012; 
van der Klift et al. 2012), which is an underestimate, as conven- 
tional mutation screening methods are not designed to amplify 
large insertions. Of these nearly 100 cases, 25 are caused by Lis, 60 
by Alus, eight by SVAs, and four by poly(A) sequence originating 
from an unidentifiable source (Hancks and Kazazian 2012; van der 
Klift et al. 2012). Of these insertions, 30 occur in cancer cases, in- 
cluding four in colon cancer patients (Miki et al. 1992; Su et al. 
2000; Kloor et al. 2004; van der Klift et al. 2012). While three of the 
four colon cancer cases involve predicted germline or early somatic 
insertions, a somatic LI insertion occurred in the APC gene in 
colon cancer (Miki et al. 1992). 

In addition to acting as insertional mutagens, retrotransposons 
can disrupt gene function and genomic integrity in many other 
ways. These include recombination-mediated gene rearrangements, 
genetic instability, transcriptional interference, alternative splicing, 
gene breaking, epigenetic effects, the generation of DNA double- 
strand breaks, and the expression of small noncoding RNAs (for 
review, see Goodier and Kazazian 2008; Beck et al. 2011). All of 
these mechanisms are compatible with a tumorigenic potential of 
these elements. Retrotransposon overdose is another potential 
scenario in malignancy and could result in increased insertional 
mutagenesis, toxicity, or other oncogenic effects. Indeed, the 
overexpression of LI ORFlp was observed in certain tumors 
(Bratthauer and Fanning 1992; Asch et al. 1996; Su et al. 2007; 
Harris et al. 2010), and RNAi-mediated silencing of Lis resulted in 
reduced proliferation and differentiation of tumorigenic cell lines 
(Oricchio et al. 2007). In addition, overexpression of Alu elements 
may exert disease through RNA toxicity (Kaneko et al. 201 1). Thus, 
the cell likely has intrinsic defense mechanisms to prevent retro- 
transposon overexpression, including methylation (Yoder et al. 
1997; Bourc'his and Bestor 2004) and the expression of several host 
proteins, such as APOBEC3 family members (Bogerd et al. 2006; 
Chen et al. 2006; Muckenfuss et al. 2006; Stenglein and Harris 
2006) or DNA repair enzymes (Gasior et al. 2006; Suzuki et al. 2009; 
Coufal etal. 2011). 

Here we applied two high-throughput Ll-targeted resequenc- 
ing methods to discover retrotransposon activity in colorectal 
cancers. We identified numerous nonreference LI insertions not 
present in paired normal tissue and report a high retrotransposon 
insertion rate in tumors. We characterized insertion size and TSDs in 
cancer tissue, confirming that Lis primarily mobilize in cancer via 
TPRT The data suggest the importance of retrotransposition in the 
biology of colorectal tumorigenesis. 

Results 

Ll display through high-throughput sequencing 

We applied two next generation resequencing methods — hemi- 
specific PCR coupled to Illumina sequencing (Ll-seq) (Ewing and 
Kazazian 2010) and retrotransposon capture sequencing (RC-seq) 
(Baillie et al. 2011) to interrogate the retroelement load of co- 
lorectal tumors. Approximately 800 nonreference LIHs copies had 
been located from individual blood or lymphoblastoid cell lines by 
Ll-seq — the same number as represented by the hgl8 reference 
genome assembly, indicating its capacity to recover essentially all 
germline LIHs elements (Ewing and Kazazian 2010). Here, we 
applied this method to recover somatic insertions from malignant 
tissues. RC-seq has previously been used to identify somatic mo- 
saicism associated with Ll, Alu, and SVA mobilization in the brain 



(Baillie et al. 2011). Its use of sequence capture for retrotransposon 
enrichment contrasts with the use of PCR by Ll-seq; as a result, 
RC-seq is expected to cover a broader range of insertions, but with 
less depth per insertion than Ll-seq. A highly multiplexed version 
of RC-seq was applied to assess whether somatic LIHs insertions 
were identified by both approaches. 

We sequenced DNA from 16 colorectal tumors and matched 
normal colons using a pooled Ll-seq-based approach. The 16 
tumor/normal pairs (32 samples total) were separated into four 
libraries of eight samples each denoted "colol /tumor," "colol/ 
normal," "colo2/tumor," and "colo2/normal." We sequenced one 
lane for each library on an Illumina HiSeq 2000 instrument with 
the exception of colol/normal, where two lanes of data were 
generated. The total number of reads generated for each library can 
be found in Supplemental Table SI. 

Using computational methods outlined in Ewing and Kazazian 
(2010), we identified clusters of reads localized 3' of predicted in- 
sertion sites. We required 100 reads spanning at least 100 bp ("high 
stringency") as a minimum for Ll detection, which yields a spec- 
ificity of >90% based on recovery of reference Ll insertions and 
nonreference sites discovered in previous studies (see Supplemen- 
tal Figs. SI, S2 for an exploration of cutoff parameters). Using these 
criteria, we identified 764 reference LIHs insertion sites present in 
NCBI36/hgl8 and 400 nonreference insertion sites from the colol 
data. From the colo2 data we identified 816 reference and 433 
nonreference insertions. Combining the data, we found 819 refer- 
ence LIHs elements and 635 nonreference elements, 336 of which 
had not been previously cataloged. Many of these uncataloged ele- 
ments are new somatic insertions in the tumor. In total, 38% of 
reference and 35% of nonreference insertions were in gene annota- 
tions based on UCSC Known Genes. The distribution of Ll inser- 
tions detected by Ll-seq in this study is shown in Figure 1. 

Our primary interest in generating these data was in finding 
insertions present either in a cancer pooled library or in a normal 




Figure 1. Genomic distribution of tl insertions. Outer rings show the 
density of detected insertion sites for reference (gray) and nonreference 
(black) Ll s. The approximate locations of the 72 PCR-validated somatic 
insertions are indicated by dots inside the circle. Note that 69 of the 72 
insertions were successfully sequenced. 
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pooled library and not present in the corresponding paired normal 
or tumor library. Turning to these, with the same stringency cutoffs 
as above, we found 35 putative insertions only in colol/tumor, 
four only in colol/normal, 50 predictions only in colo2/tumor, 
and eight only in colo2/normal. Decreasing the requirements for 
predicted insertions to 10 reads spanning at least 100 bp ("low 
stringency"), we found 69 potential insertions only in colol/tu- 
mor, 173 only in colol/normal, 75 only in colo2/tumor, and 42 
only in colo2/normal. The dramatic increase in predictions for 
colol/normal only with decreasing stringency is an effect of the 
higher coverage in the colol/normal versus colol/tumor (Supple- 
mental Table SI), as two lanes of sequence were generated for 
colol/normal. 

Five of the Ll-sequenced colorectal tissue pairs were bar- 
coded, pooled, and analyzed by shallow, multiplexed RC-seq (10 
libraries, —75 million paired-end Illumina GAIIx reads). A total of 
26,903 nonreference genomic insertions were detected by at least 
one read (Supplemental Table S2). Of these, 358 were (1) found in 
only one donor, (2) were not identified in RC-seq previously per- 
formed on pooled blood (Baillie et al. 2011) or databases of retro- 
transposon polymorphisms (Huang et al. 2010; Iskow et al. 2010; 
Ewing and Kazazian 2011), and (3) could 
be annotated with high confidence due to 
detection by multiple unique amplicons. A 
Of this set, 96 were only found in tumor, WT allele , 

including eight LI, 83 Alu, and five SVA. A ES prime 

total of 39 insertions were found only in \ 

nontumor samples and 223 were found in 
both tumor and nontumor. The tumor: 
nontumor ratios for LI, Alu and SVA, were 
—8:1, 2.5:1, and 2:1, respectively. AluY and 
Ll-Ta/pre-Ta were detected, but no HERVs 
were detected. 



and were able to validate nine. Thus, we PCR-validated before se- 
quencing the 3' ends of 72 of 107 putative insertions (Supple- 
mental Table S3; Supplemental Text SI). 

Interestingly, among 12 high-stringency putative insertions 
from normal colon of the combined colol and colo2 data sets, 
none could be validated. Possible explanations for false positives in 
the Ll-seq data include PCR artifacts arising during library prepa- 
ration, suboptimal PCR conditions used for validation, or LI in- 
sertion into repetitive sequences, refractory to successful primer 
design. 

Our stepwise PCR amplification scheme continued by re- 
trieving the 5' junctions of tumor-specific LI insertion events. 
Several empty-site PCRs had already yielded a higher molecular 
weight band exclusively in the tumor, which in each case was 
verified to be a highly truncated LI element (Fig. 2B). In the re- 
maining cases, long-range PCR and a PCR specifically designed to 
amplify the 5' end of a full-length LI were used to retrieve the 5' 
junction. 

Altogether, out of 72 cases where the insertions were PCR- 
validated to be tumor specific, we successfully sequenced either the 
3' or the 5' junction in 69 cases (Supplemental Table S3; Supple- 
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Figure 2. PCR validation scheme of Ll-seq results. (A) The three-step PCR validation scheme and 
location of primers used. Triangles symbolize TSD. (B) PCR validation of the 3' junction (ins. 7). This 
insertion is in tumor 1 of the eight DNA samples that had been pooled for Illumina sequence analysis 
(left), while the right panel shows it is present exclusively in the tumor, but not in the normal colon. The 
higher molecular weight band visible above the ins. 7 empty site PCR product in the tumor is a highly 
truncated Ll . (T) Tumor; (N) normal colon; (FS) filled site PCR product; (ES) empty site PCR product. 
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mental Text SI). For 35 insertions, we sequenced both junctions, 
enabling us to characterize TSDs and LI insertion size in cancer 
tissue (Table 1). Surprisingly, all of the tumor-specific insertions 
were highly truncated, the mean LI insertion size being 585 bp, 
excluding the poly(A) tail (Table 1). 

Using a PCR designed to amplify full-length LI insertions, we 
failed to amplify the 5' end of any of the remaining tumor-specific 
insertions where 5' junctions could not be identified with the 
previous PCR approaches. On the other hand, three of 10 germline 
polymorphic insertions had an intact 5' end, in agreement with 
30% of reference LIHs elements being full length (Pavlicek et al. 
2002). This difference between full-length LI insertions in tumors 
(—0%) versus full-length insertions among polymorphic germline 
LIHs (—30%) is statistically significant (P = 0.016, Fisher's exact 
test) and is a clear departure from what is observed from the ref- 
erence genome and from heritable nonreference insertions (Ewing 
and Kazazian 2010). 

Ll-seq results on five tumors were corroborated by RC-seq. 
Eleven high-confidence LIHs hits were found in these cancers by 
L 1 -seq, out of which four were also detected by RC-seq at either the 
5' or 3' junction (ins. 5, 9, 14, and 32) (Supplemental Tables S2, S3; 
Supplemental Text SI). Among eight high-confidence LI inser- 
tions within genes detected by RC-seq, but missed by Ll-seq, one 
was validated as present in tumor 10, targeting the DGKI gene 
(Supplemental Table S2) (Fig. 3). Of the remaining putative tumor- 
specific insertions that could be PCR-amplified, six of eight Lis, 30 
of 57 Alus, and six of 11 SVAs were present in both tumor and 
paired normal tissue. No other confirmed tumor-specific insertions 
from the Ll-seq data were found by RC-seq. 

In order to determine whether Lis are frequently mobilized in 
other nonmalignant somatic tissues, we performed Ll-seq on ge- 
nomic DNA extracted from cerebrum, liver, and testis samples 
from two other individuals (cadaver samples) who had died from 
arteriosclerotic cardiovascular disease. None of the Illumina high- 
stringency sequence peaks suggestive of somatic LI insertion could 
be PCR-validated, implying that the high rate of somatic LI in- 
sertions observed in colon cancer was specific to the malignant 
tissue. Thus, we have no evidence from our data of somatic inser- 
tions in normal colon, liver, testis, or cerebrum. 

Characterization of tumor-specific insertions 

Intriguingly, the number of validated LI insertions varied widely 
from tumor to tumor with up to 1 7 insertions in some and none in 
three others (Fig. 3). Most retrotransposition events showed hall- 
marks of TPRT, namely TSD (27/35), LI endonuclease cleavage site, 
the presence of LI poly(A) tail, frequent 5' inversion (10/35), and 
in one case, a 3' transduction (Table 1). However, we note that a 
substantial fraction of these somatic insertions (8/35) lacked a TSD 
and six of these lacked a discernible endonuclease cleavage site, 
suggesting that they were endonuclease-independent insertions 
(Morrish et al. 2002). Two insertions ("3" and "21") contained 3' 
sequence from other chromosomes, but lacked a poly(A) tail in 
between the two sequences that would indicate a 3' transduction. 
Thus, they are likely cancer-associated recombination events 
(Supplemental Table S3; Supplemental Text SI). 

Numerous genes were targets for insertional mutagenesis in 
colon tumors by Lis that are represented in the COSMIC database 
(Catalogue of Somatic Mutations in Cancer, http://www.sanger. 
ac.uk/genetics/CGP/cosmic/). Examples include PTPRM (protein 
tyrosine phosphatase, receptor type, M), ODZ3 (odd Oz/ten-m 
homolog 3), ROB02 (roundabout, axon guidance receptor, ho- 



molog 2), PCM1 (pericentriolar material 1), and CDH11 (cadherin- 
11). PCM1 and CDH11 are also represented in Sanger's Cancer 
Gene Census (http://www.sanger.ac.uk/genetics/CGP/Census/). 
Interestingly, according to COSMIC, in the large intestine these 
genes were mutated with the following high frequencies: PTPRM 
(50%), ODZ3 (100%), ROB02 (15%), PCM1 (12%), CDH11 (52%). 
All of our hits were intronic and were PCR validated as well as se- 
quenced. Additional interesting genes with a potential role in 
malignancy were also targeted, for instance, RUNX1T1 (runt- 
related transcription factor 1), a member of the myeloid trans- 
location genes. Interestingly, somatic RUNX1T1 point mutations 
were not only found in colorectal cancers (Wood et al. 2007), but 
the product of the related RUNX3 gene regulates LI expression 
(Yang et al. 2003). 

Cellular timing of LI retrotransposition 

In an effort to determine at what point in tumorigenesis the LI 
insertions occurred, we developed three lines of evidence: analysis 
of SNPs in sequence flanking the LI insertion and the empty site, 
the number of empty site X chromosome alleles in males who had 
an LI insertion into the X, and the presence/absence of the LI 
insertion in a second section of a particular tumor in which an LI 
insertion occurred. 

First, we found three insertions (C2, C4, and insertion 31) 
with flanking heterozygous SNPs. For C2 (ODZ3 gene) there was 
one SNP, for C4 there were four SNPs, and for insertion 31 there 
was one flanking SNP (Supplemental Fig. S3 A). The presence of 
both alleles of the particular SNP in the empty site chromosomes is 
informative in case of no aneuploidy at the respective alleles. If 
the insertion occurred at the initiation of the tumor (the one-cell 
stage), the filled site would contain one allele and the empty site 
would contain the other allele only. If both alleles are present in 
the empty-site chromosomes, then the insertion likely occurred 
after the one-cell stage of the tumor. The data showed that all six 
SNPs near three different insertions in the empty-site chromo- 
somes were heterozygous, suggesting that the insertions occurred 
after the initiation of tumorigenesis. We carried out array com- 
parative genomic hybridization (aCGH) and found no copy- 
number gain or loss in the chromosomal arm of these SNPs (data 
not shown), although small chromosomal aberrations at the re- 
spective alleles cannot be ruled out. 

Second, we found insertions (D9, D12, and E3) into the X 
chromosome in two males. If the insertion occurred at the one-cell 
stage and the male did not have X-chromosome aneuploidy, the 
tumor should lack an empty site. However, in all cases, we found 
an empty-site band by PCR, indicating again that the insertions 
occurred after the one-cell stage of tumorigenesis (Supplemental 
Fig. S3B contains data on D9 and D12). Again, aCGH showed that 
both males had a single X chromosome. 

Third, we obtained a second portion of tissue from a number 
of tumors and determined whether the insertions found in the first 
tumor tissues could be confirmed in the second tumor sample. In 
three of seven instances (insertions C2, D7, and E3) we were able to 
confirm the insertion in a second tumor sample, suggesting a rel- 
atively early event in tumorigenesis. Four other insertions ("31," 
"A10," "C4," "D12") were present in the first tumor section, but 
not in the second (Supplemental Fig. S3C). In tumor 2853, two 
insertions were studied: E3 was present in both tumor portions, 
while D12 was not, suggesting that these two insertions occurred 
at different times and in different cells of the tumor (data not 
shown). Furthermore, heterozygosity for LI flanking SNPs in in- 
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Figure 3. Distribution of somatic LI insertions in tumors. Insertions in black were detected by LI -seq, 
while the insertion in tumor 1 0 in white was detected by RC-seq only. 



sertions 31 and C4 in the original tumor sample, as well as the 
absence of the insertion from the second tumor portion both 
suggest late insertion events. Thus, from the combination of these 
data on a small sample size we conclude that most, if not all, of the 
studied LI insertions occurred after the initiation of the tumor. 
However, a minority of the insertions may have occurred at an 
early stage of tumorigenesis. Furthermore, we cannot exclude the 
possibility of tumor blood vessels or infiltrating lymphocytes as 
contributing alternative explanations for some of the results. 

Effects of LI retrotransposition on measures of cellular 
instability 

To address the increased rate of somatic LI retrotransposition in 
tumors, we assessed the genomic landscape of the tissue samples 
by aCGH, microsatellite instability (MSI), and LI promoter meth- 
ylation status. A previous report demonstrated a correlation of 
genome- wide DNA methylation status of tumors with increased de 
novo LI insertions (Iskow et al. 2010). We assessed the methyla- 
tion status of the LI promoter at four different CpG sites. Although 
LI promoter hypomethylation was found in tumor samples 
compared with paired normal tissue, no correlation was observed 
between LI methylation status and the number of LI insertions 
(Fig. 4A). 

For aCGH, we analyzed normal and tumor tissue from six 
patient samples, two that possessed the greatest number of de novo 
tumor insertions (12 and 1775), two with no tumor insertions (8 
and 7647), and the two males mentioned above with both empty 
and filled insertion sites in X chromosomes (17 and 2853). Tumor 
samples 8, 12, 17, and 1775 contained complex chromosomal 
changes, including entire and partial chromosomal gains and 
losses. Tumor samples 2853 and 7647 presented no detectable 
aberrations relative to normal tissue. Interestingly, three of the 
samples (12, 17, and 1775) with complex chromosomal rear- 
rangements had a high number of validated insertions. Likewise, 



patient 7647 with no validated LI in- 
sertions had no detectable chromosomal 
changes. The outliers from the trend of a 
direct relationship between chromosomal 
aberrations and LI insertions were pa- 
tients 8 and 2853. Interestingly, these 
latter patients are potentially genetically 
predisposed to colon cancer due to fa- 
milial cancer aggregation or a very young 
age of diagnosis. 

In addition to analysis of gross ge- 
nomic abnormalities, we assessed the 
status of the mismatch DNA repair path- 
way by assessing microsatellite expansions 
in the genome. Seven of the 16 patients 
were MSI positive. Although two samples 
with the highest number of somatic LI 
insertions were MSI positive, MSI status 
did not correlate with the number of de 
novo LI insertions for each tumor (Fig. 4B). 

Interestingly, a statistically signifi- 
cant correlation was observed between 
the number of insertions and the age 
of the investigated patients (P = 0.01425, 
R 2 = 0.3128, where age is the time of 
surgical sample removal). Eight or more 
validated insertions were observed only 
in the tumors of patients 78 yr old or older. An outlier in the cor- 
relation was a 72-yr-old with no validated insertions. However, he 
was the only proband with rectal cancer, but no colon tumor di- 
agnosis. When this patient was excluded from the analysis, as well as 
cases with a presumed genetic predisposition to colon cancer (fa- 
milial polyposis case and a 1 7-yr-old male), an even more significant 
correlation was observed between the age of sporadic colon 
cancer patients andLl activity (P = 0.001548, R 2 = 0.578) (Fig. 4C). 

Discussion 

Our Ll-seq method has revealed a high rate of LIHs retro- 
transposition in certain colorectal cancer genomes. The neigh- 
boring matched normal colon sample in these 16 cases, as well as 
cerebrum, liver, and testis from two other individuals yielded no LI 
insertions that could be validated, indicating few or no retro- 
transposition events in these normal tissues. 

Iskow et al. (2010) used 454 pyrosequencing to search for de 
novo LI insertions in five glioblastomas, five medulloblastomas, as 
well as leukemia and breast cancer cell lines, but they found no 
insertions in these cases. However, they identified nine somatic LI 
insertions in six of 20 lung tumors. Since TSDs were not reported, 
the question of whether LI integration in lung cancer occurs 
through TPRT remained open. 

Here we report that evolutionarily young LIHs retrotransposons 
can mobilize themselves through the classical TPRT mechanism 
in colon cancer genomes at a high frequency. The true retro- 
transposition rate is likely to be even higher, as our method does 
not detect insertions mobilized by LI elements in trans, such as 
Aim, SVAs, most inactive Lis, and processed pseudogenes. In ad- 
dition, there are likely other LI insertions in our data set that have 
not been subjected to validation. Longer tumor-specific 3' trans- 
ductions will be missed as well, as it is difficult to differentiate 
between the progenitor and the transduced sequence by their 3' 
flank with Ll-seq. 
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Figure 4. Analysis of factors influencing tl activity. (A) LI CpG promoter methylation status per- 
formed by quantitative bisulfite PCR analysis. (N) Normal tissue; (T) tumor tissue; (*) MSI. Replicates of 
four were done for each data point. (Error bars) Standard deviations. (8) MSI analysis. 6% TBE gel 
depicting the status of five microsatellite repeats (BAT25, BAT26, D2S123, D17S250, D17346 in 
descending order) in normal and tumor tissue from two different patients. Tumor tissue "6" contained 
additional bands and gel shifts compared with the normal tissue, indicating MSI. Samples from "8" 
demonstrated no differences suggestive of MSI. (C) Correlation of LI activity with age of the patient at 
time of surgery. See text for details. 



In order to determine why the rate of somatic cell retro- 
transposition was high in some tumors compared with the normal 
tissue, we assessed how the following factors correlate with retro- 
transposition rate: age of colorectal cancer patients, chromosomal 
aberrations, LI methylation status, and mismatch DNA repair as 
reflected by MSI. A clear correlation of retrotransposition activity 
was observed with the age of colon cancer patients in sporadic 
cases. Furthermore, we also analyzed genetic instability by aCGH 
in six samples and found a modest association between the num- 
ber of chromosomal aberrations and the age of the sporadic colon 
cancer patients. Altogether, it is possible that the hypomethylated 
microenvironment of the tumors, together with genetic instability 
as reflected by MSI and gross chromosomal changes, have a cu- 
mulative effect in older patients. Our results are in agreement with 
the correlation of retrotransposition activity with genome insta- 
bility during yeast chronological aging (Maxwell et al. 2011). 

In this study, we found that genes with a known driver 
function in cancer are mutagenized by LIHs elements. The accu- 
mulation of retrotransposon sequences is predicted to cause fur- 
ther genetic instability through recombination. Thus, an elevated 
insertion rate is expected to contribute to tumor evolution. As 
exemplified by a somatic LI insertion into the APC gene (Miki et al. 
1992), it is clear that in some fraction of colorectal cancers retro- 
transposon insertions can be etiologically significant. Yet, it re- 
mains unclear in what fraction of cases retrotransposons initiate 
malignant transformation and in how many instances they con- 



tribute solely to a more aggressive phe- 
notype. SNP, X chromosome, and second- 
ary sampling data from tumor samples 
suggest that LI insertions likely occurred 
at various times after the initiation of the 
tumor. Although we found no evidence 
for LI insertion into colon cancer tumor- 
suppressor genes or oncogenes that would 
be indicative of driver mutation-induced 
tumor clonality, analysis of a larger num- 
ber of tumors or a deeper sequencing of 
retrotransposon insertions could uncover 
such events. We propose that it is possible 
to estimate insertion timing and tumor 
heterogeneity more precisely by evaluat- 
ing pure tumor samples. Our findings are 
in agreement with a very recent report 
on retrotransposon insertions in epithelial 
cancers (Lee et al. 2012). Intriguingly, an 
intronic LI integration event was found in 
that study as well in the ROB02 gene in 
a colon tumor. Additionally, they detected 
intronic LI insertions in CDH12, while we 
characterized an insertion into the CDH1 1 
gene. Thus, the role of cell-adhesion genes 
in retrotransposon insertion-mediated 
colorectal tumorigenesis may deserve fur- 
ther investigation. 

An unexpected finding of our PCR- 
based validation is the severely truncated 
nature of all validated LI insertions in 
colon cancer. It was not possible to assess 
whether this is a general characteristic of 
the malignant phenotype, of all somatic 
tissues, or of the gastrointestinal tract in 
particular, as no de novo LI insertions 
could be uncovered from normal colon, liver, testis, and brain. In 
a transgenic mouse model, 30 of 33 somatic LI insertions were 5' 
truncated (Babushok et al. 2006), raising the possibility of a gradual 
decrease in LI size from germline to somatic to malignant in- 
sertions. In cultured HeLa cells, 94/100 insertions were 5' trun- 
cated (Gilbert et al. 2005). This might indicate that some cancer 
tissues or cultured cells could allow full-length insertions to accumu- 
late. The overexpression of an exogenous LI element, coupled with 
the bias toward recovering larger inserts in that assay, and the un- 
known effects of cell culture conditions on retrotransposition com- 
plicate transferring conclusions on the LI 5' truncation rate to cancer 
tissue. Likewise, it is not understood why the majority of germline LI 
insertions are 5' truncated as opposed to Alu and SVA insertions that 
are mostly full length, yet also mobilized by Lis (Hancks et al. 2011). 

The truncated structure of LI elements in colorectal cancer 
may be useful in understanding the mechanism of 5' truncation 
both in normal and tumor cells. We propose two possible expla- 
nations: (1) If TPRT timing is coupled to the cell cycle, the elevated 
cell division rate of malignant cells may not leave sufficient time to 
complete integration of long mobile elements; (2) a DNA repair 
pathway might monitor and remove de novo mobile element 
insertions in healthy tissues. Once this presumed surveillance 
pathway is down-regulated in cancer, retrotransposon insertions 
are not removed efficiently and are allowed to accumulate. At the 
same time, another or the same DNA repair pathway might spe- 
cialize in truncating fresh integrants or prohibiting them from 



Genome Research 2335 

www.genome.org 



Solyom et al. 



completing retrotransposition. If this process is up-regulated, the 
truncation rate increases. The efficiency of such a pathway might 
correlate with insertion size or be sequence specific, thus prefer- 
entially targeting LI elements over Alus and SVAs. Interestingly, 
nonhomologous end joining (NHEJ) is an important DNA repair 
pathway candidate with a reported conflicting dual role in regu- 
lating retrotransposon insertions, offering an explanation for 
parallel LI up-regulation and truncation (Suzuki et al. 2009). We 
propose that by comparing the genome or transcriptome of tumors 
with a high rate of retrotransposition to their paired normal tis- 
sues, we may discern clues to cellular factors causing LI mobili- 
zation and 5' truncation. 

To conclude, the cancerous colon of many patients is the 
second reported organ beside the brain (Baillie et al. 201 1) in which 
a high rate of retrotransposition occurs. Lung, prostate, and ovar- 
ian tumors are also reported to allow a lower level of LI mobili- 
zation (Iskow et al. 2010; Lee et al. 2012), but many other cancer 
types appear to be nonpermissive for a detectable rate of retro- 
transposition. All LI insertions in the colorectal tumors of this 
study were highly truncated, potentially indicating the footprint 
of a defective or hyperactive DNA repair pathway in cancer. The 
cause and effect of retrotransposon mobilization in cancers war- 
rants further investigation. 

Methods 

Human DNA samples 

DNA was extracted from human patient tissue samples acquired 
from the University of Minnesota Tissue Procurement Facility 
from BioNet (IRB#0805E32181). See Supplemental Table S4 for 
patient data. Briefly, 2 mg of tissue was digested overnight at 55°C 
on a rotating platform in 710 |j.L of digest buffer (1 M Tris at pH 
8.0, 1 mM EDTA, lx SSC, 1% SDS, 1 Mm NaCl, 10 (jig/mL Pro- 
teinase K). Following digest, DNA was purified using phenol- 
chloroform-isoamyl alcohol (Life Sciences) isolation protocol. 

Human frozen tissue from two Caucasian cadavers with 
arteriosclerotic cardiovascular disease were obtained from the 
NICHD Brain and Tissue Bank for Developmental Disorders at the 
University of Maryland, Baltimore. DNA isolation was done uti- 
lizing the AllPrep DNA/RNA Mini Kit (Qiagen). 

Library construction, sequencing, and analysis 
Ll-seq 

The library for LIHs elements was made according to Ewing and 
Kazazian (2010), while the library for LI, Alu, and SVA elements 
(RC-seq) was constructed according to Baillie et al. (2011). LIHs 
elements were TOPO-TA cloned (Invitrogen) and Sanger-se- 
quenced for quality control of the library preparation, and were 
subsequently sequenced on an Illumina HiSeq 2000 at the Johns 
Hopkins University Genetic Resources Core Facility High Through- 
put Sequencing Center. 

Pooled Ll-seq library sequence data was analyzed as described 
previously (Ewing and Kazazian 2010), and compared against in- 
sertion sites of known reference and nonreference transposable 
elements (Beck et al. 2010; Ewing and Kazazian 2010, 201 1; Huang 
et al. 2010; Iskow et al. 2010; Witherspoon et al. 2010; Hormozdiari 
et al. 2011; Stewart et al. 2011). Gene annotations were obtained 
from UCSC Known Genes (Hsu et al. 2006). 

RC-seq 

The library for retrotransposon capture and sequencing was cre- 
ated utilizing DNA from five pairs of colorectal and normal tissue 



(samples 1; 4; 6; 8; 10) by the same method as published previously 
(Baillie et al. 2011). One significant change was made to the 
technique, in that liquid phase hybridization was performed as 
opposed to solid surface, chip-based hybridization. The libraries 
were then sequenced on an Illumina Genome Analyzer II and 
aligned to the genome by a computational pipeline that utilized 
SOAP2 to align reads to the genome and used much the same 
method as published (Bailie et al. 2011; R Shukla, KR Upton, 
M Munoz-Lopez, DJ Gerhardt, JK Baillie, ME Fisher, PM Brennan, 
A Collino, S Ghisletti, S Sinha, et al., in prep.). 

PCR validation of the Illumina results 

A three-step PCR validation protocol was used to validate the next- 
generation sequencing reads and to retrieve 3' and 5' junctions. As 
the first step, LI 3' ends together with flanking genomic regions 
were amplified using the same AC dinucleotide-specific primer of 
LIHs as used for Illumina sequencing (LIHs primer: GGGAGAT 
ATACCTAATGCTAGATGACAC) and a primer selected from the 3' 
flanking region based on the reference genome sequence (FS 
primer). PCR reactions were carried out in 12.5 (jlL of 2x GoTaq 
Green master mix (Promega) in a total volume of 25 yJL, with 0.8 
(jlL of FS primer, 1.5 (jlL of LIHs primer, and 25 ng of DNA to am- 
plify the filled site. The empty site was amplified with the same 
conditions, except that 1.5 (jiL of FS primer, 1.5 jjlL of ES primer, 
and 12.5 ng of DNA were used. Primers were 20 pmoLyL and their 
location is depicted in Figure 2A. Reactions were incubated for 2 
min at 95°C, followed by 30 cycles of 30 sec at 95°C, 30 sec at 5 7°C, 
and 1.5 min at 72°C, followed by a final extension of 5 min at 72°C 
on a PTC-200 Peltier Thermal Cycler. Long-range PCR to recover 
longer LI insertions was performed with the Expand Long Template 
PCR System (Roche) according to the manufacturer's instructions 
in buffer 1, with 1 (jlL of 20 (jlM FS and ES primers each, and 25 ng of 
tumor DNA. 5' junctions were PCR amplified using the same 
conditions as for the 3' junction, except that a primer hybridizing 
to the LI 5'UTR was used (Llntll2out: GATGAACCCGGTACCT 
CAGA) together with the respective ES primer, and primer exten- 
sion time was only 45 sec. FS and ES primer sequences are included 
in the Supplemental Material (Supplemental Table SI). PCR 
products were cut out of the gel, extracted with the QIAquick Gel 
Extraction Kit (Qiagen), and sequenced. See Supplemental Text SI 
for Sanger sequence data on insertions. 

Microsatellite instability assays 

To assess the MSI status, we utilized five markers recommended 
by the National Cancer Institute (Bethesda markers): BAT25 and 
BAT26 to assess mononucleotide repeats (A)n and D2S123, 
D5S346, and D17S250 to assess dinucleotide repeats (CA)n. MSI 
status was determined using previously established protocols 
(Ashktorab et al. 2003; Muller et al. 2004). Primers were developed 
by the NCI for screening patients in the clinic. 

LI methylation status 

The methylation level of LI promoters was performed according to 
Wilhelm et al. (2010). Briefly, each sample was amplified three 
times and each amplification was pyrosequenced once. The aver- 
age of the three was utilized to determine the value of CpG 
methylation for each of the four positions analyzed for an LI. 

aCGH 

DNA from patients 8, 12, 17, 1775, 2853, and 7647 were restriction 
digested and labeled with fluorochrome Cyanine-5 using random 
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primers and exo-Klenow fragment DNA polymerase. DNA from 
a sex-matched control was labeled concurrently with Cyanine-3. 
The sample and control DNA were combined and array-based 
comparative genomic hybridization (aCGH) and single nucleotide 
polymorphism analysis (SNP) was performed with a 180K Cancer 
CGH+SNP microarray constructed by Agilent Technologies, Inc. 
that contains —115,000 distinct biological oligonucleotides and 
55,000 SNP sites, spaced at an average interval of 25 KB (for 20,000 
cancer-associated CGH probes: one probe/0.5-1 KB). The ratio of 
sample to control DNA for each oligo was calculated using Feature 
Extraction software 10.10 (Agilent Technologies). The abnormal 
threshold was applied using Cytogenomics 2.060 (Agilent Tech- 
nologies). A combination of several statistical algorithms was ap- 
plied. A minimum of three oligos that have a minimum absolute 
ratio value of 0.1 (based on a log(2) ratio) is required for reporting 
of a copy-number loss or gain. Analysis was performed using Hu- 
man Genome Build 19 (Feb 2009) as the reference. 

Data access 

The sequence and phenotypic data from this study have been 
deposited in dbGaP (http://www.ncbi.nlm.nih.gov/dbgap) under 
accession number phs000536.vl.pl. The dbGaP accession number 
assigned to this study is phs000536.vl.pl. 
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